Project - FMST - Credit Card Users Churn Prediction

Background and Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas.

As a Data scientist at Thera bank we need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards.

We need to identify the best possible model that will give the required performance.

Objective

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Key Questions

  1. What are the key factors influencing whether a customer will churn or renounce his or her credit card?
  2. Is there a good predictive model so that we can address the customer attrition?
  3. What does the performance assessment look like for such a model?

Data Dictionary

Import necessary libraries and load data

Observations:

  • CLIENTNUM column can be dropped as it is a unique number and does not contain any useful information.
  • There are null or missing values in the data for Education_Level and Marital_Status columns.
  • Both numerical and object data type columns are present in the dataset.

Observations:

  • Education_Level has 1519 missing values.
  • Marital_Status has 749 missing values.
  • Rest of the columns have no null values but we will check if there are any wrong or missing values.

Observations:

  • Customer_Age: Customer age varies from 26 to 73 years. With 75% customers are within 52 and median of 46 years.
  • Dependent_count: Minimum of 0 and maximum of 5 dependents. Data looks good with median of 2 dependents.
  • Months_on_book: Month on book ranges from 13 to 56 months. This range seems practical and makes sense.
  • Total_Relationship_Count: Number of products held by customer are from 1 to 6. Majority of customers have more than one product with median of 4 products.
  • Months_Inactive_12_mon: Inactivity ranges from 2 months to six months. This data looks good.
  • Contacts_Count_12_mon: Number of contacts between customer and bank ranges from 0 to 6 times in last 12 months. This seems reasonable with average of 2 contacts.
  • Credit_Limit: Credit limit of customers ranges from 1438 to 34516. The median credit limit is little over 4500.
  • Total_Revolving_Bal: Total revolving balance is also looking good with min 359 and max 2517 with mean of 1163.
  • Avg_Open_To_Buy: This feature is closely related to credit limit it seems. Max value 34516 and min of 0.
  • Total_Trans_Amt: In last 12 months total transactions amount varies from 510 to 18484 with median of 3899.
  • Total_Trans_Ct: Min of 10 and max of 139. Looks good, it should be related to the Total_Trans_Amt.
  • Total_Amt_Chng_Q4_Q1: Min of 0 and max of 3.4. But it seems most of the customers spent less in Q4 than in Q1.
  • Total_Ct_Chng_Q4_Q1: Similar to amount with min of 0 and max of 3.7. Again it seems most of the customers swipped their credit cards less in Q4 than in Q1.
  • Avg_Utilization_Ratio: Nothing ubnormal with min of 0 and max of 0.999. 75% customers have used only upto 50% of their credit limit.

Let's check the summary of other categorical variables, including the null values if any.

Observations:

  • Attrition_Flag: This is the target variable has only two values. Attrited Customer are far less than Esixting Customer. We have an unbalanced dataset.
  • Gender: Consists of two values with F being most frequent.
  • Education_Level: There are six education levels but we see 1519 about 10% as null. We would impute this with mode value which is Graduate.
  • Marital_Status: Three different values with mode Married. There are 749 null values. We would impute this with mode value which is Married.
  • Income_Category: There are five income categories but it seems there is wrong value 'abc'. Again this will be relaced with mode.
  • Card_Category: Card categories look good with 4 different products. Most of the customers have Blue cards.

Univariate Analysis

Let's check the distribution of each feature and comment on the feature distribution as necessary. For this purpose we define the below functions to analyze the nemerical and categorical type of values effectively.

Analysis of Customer_Age

Analysis of Months_on_book

Observations:

  • Customer Age and Months on book data are normally distributed with central tendency and with few outliers.
  • Outliers for customer age are above 65 years and within 73 years. Which are within acceptable range.
  • Similarly outliers of Months on book are also within acceptable ranges and we do not need to treat them.

Analysis of Total_Revolving_Bal

Observations:

  • Most of the customers have very low Total revolving balance.
  • It seems data distribution is an overlap of multiple distributions, especially towards the lower and higher ends of the distribution causing mean to be smaller than median.
  • There are no outliers in the data.

Analysis of Credit_Limit

Analysis of Avg_Open_To_Buy

Observations:

  • Credit limit and Avg open to buy are smililar and normally distributed but right skewed. Median is much less than mean. Again this could be due to multiple overlapping Gaussians.
  • There are many outliers on the right but it is not uncommon to have credit limit or remaining balance as 35K. Hence these could be valid values in the dataset.
  • During model building, StandardScaler will be applied to reduce the range of values.

Analysis of Total_Ct_Chng_Q4_Q1

Analysis of Total_Amt_Chng_Q4_Q1

Observations:

  • Ratios of transactions count and amount between Q4 and Q1 are also having similar normal distributions with high right skew. Although mean and median are not too far apart.
  • There are many outliers on both sides but it is not uncommon to have 0 or 3.5 times more amount or count between Q4 to Q1. We do not see any need to treat these outliers.

Analysis of Avg_Utilization_Ratio

Observations:

  • Avg utilization ratio is a higly right skewed distribution and median is less than mean.
  • There are no outliers in the data.
  • Most of the customers have very low utilization of their credit limit.

Analysis of Total_Trans_Ct

Analysis of Total_Trans_Amt

Observations:

  • Total Transactions amount and count clearly shows multiple Gaussians, hence we see multiple normal distributions side by side.
  • There are many outliers for the amount on the right side but these could be due to multiple customer data segments.
  • The high values of transactions count (140) and amount (175K) are not uncommon. Hence we do not see any need for outlier treatment.
  • StandardScaler will be applied to reduce the range of values during model building.

Analysis of Dependent_count

Observations:

  • Over 50% of customers have 2 or 3 dependents.
  • About 4% have 5 dependents and 9% are having no dependents.

Analysis of Total_Relationship_Count

Observations:

  • Over 22% of customers have 3 products from the bank.
  • Almost equal number of customers (18-19% around) have 4,5 or 6 products.
  • Only 9% customers are having one product with bank.

Analysis of Months_Inactive_12_mon

Observations:

  • Allmost all customers were inactive between 1 to 6 months of the year.
  • Around 70% of customers were inactive for 2 to 3 months in the year.
  • This is inline with the average utilization ratio, being inactive reduces the utilization ratio.

Analysis of Contacts_Count_12_mon

Observations:

  • Around 65% of customers have 2 or 3 contacts with bank in last 12 months.
  • About 4% have are not in contact with bank.

Analysis of Gender

Observations:

  • Around 53% are Female customers.
  • Around 47% are Male customers.

Analysis of Education_Level

Observations:

  • Close to one third of customers are Graduate.
  • Around 15% are Uneducated
  • About 10% have Post-Graduate and Doctorate degrees.
  • Education_Level had Null values, which will be imputed with Graduate

Analysis of Marital_Status

Observations:

  • Majority of the customers are Married followed by Single.
  • A little more than 7% are Divorced.
  • Marital_Status had Null values, which will be imputed with Married

Analysis of Income_Category

Observations:

  • Customers earning less than 40K stands out at around 35%
  • It seems there is a missing range from 80K to 120K

There is an invalid category 'abc' which can be those missing ranges but let's check a sample of the data where Income_Category is 'abc'.

Looking at the sample data it seems there is no pattern as such. We see Income_Category 'abc' has different Education_Level, Credit_Limit or Total_Trans_Amt etc. We cannot conclude with 100% confidence that 'abc' is the missing ranges. Hence due to lack of other information 'abc' will be imputed with mode value.

Analysis of Card_Category

Observations:

  • Over 93% of customers have the Blue card. This seems very popular card with customers.
  • There are not many customers for high end credit cards such as Gold or Platimum.

Analysis of Attrition_Flag

Observations:

  • This is the target variable. Around 16% customers have closed their card account.
  • Around 84% customers are in good standing.
  • From modelling perspective, we have an unbalanced classification.

Multi-Variate Analysis

Distribution of Customer Attrition with Non-Categorical features

Ignoring the outliers let's check the distribution of Attrited Customers based on the non-categorical features

Observations:

  • Customer_Age and Months on book: Existing or Attrited customers with regards to these features are mostly same, although Attrited customers have little higher median.
  • Months_Inactive_12_mon: Customers with 2 to 3 months of inactivity have churned more than other customers.
  • Contacts_Count_12_mon: Customers who have given up their account have more contact points with Bank in general than Existing customers.
  • Credit_Limit and Avg_Open_To_Buy: Customers with comparatively lower credit limit or Avg_Open_To_Buy have attrited more than the existing customers.
  • Total_Amt_Chng_Q4_Q1 and Total_Ct_Chng_Q4_Q1: Customers with lower ratios of Transactions count and amount with Q4 and Q1 have dropped off more.
  • Total_Trans_Amt and Total_Trans_Ct: Again there is a strong evidence that Customers with lower total transactions count and amount have given up their credit cards, compared to existing customers.
  • Total_Revolving_Bal: Low Total_Revolving_Bal balance seems to be driving factor for customer attrition.
  • Avg_Utilization_Ratio: Attrited customers mostly have lower average utilization ratio.

Distribution of Customer Attrition with Categorical features

Now let's check the distribution of Customer Attrition with regards to categorical values.

Observations:

  • We do not see any significant pattern of existing and attrited customers with regards to Gender, Dependent_count or Marital_Status.
  • Marital_Status has some null values and when we replace them with mode then this observation may change.
  • Education_Level: Doctorate and PGs have little higher percentage of attrited customers but number of samples are very less to conclude any pattern.
  • Income_Category: Customer attrition percentage is almost similar for all Income_Category but since a significant number from 'abc' category are attrited customers when we merge this with Less Than 40K group, it will significantly increase the percentage of that category.
  • Total_Relationship_Count: We see a pattern here, customers with lower products like 1 or 2 have attrited more.
  • Card_Category: Platinum card customers have attrited more but again the number of samples for Platinum cards are less than 1% hence we cannot draw any strong conclusion here.

Correlation between Numerical features

Observations:

  • Avg_Open_To_Buy and Credit_Limit are strongly correlated. During model building we are going to drop Avg_Open_To_Buy.
  • Total_Trans_Amt and Total_Trans_Ct are also higly correlated. Hence Total_Trans_Ct will be dropped.
  • Total_Amt_Chng_Q4_Q1 and Total_Ct_Chng_Q4_Q1 interestingly are not higly correlated hence we will keep both features.
  • Customer_Age and Months_on_book are highly correlated as well.
  • Total_Revolving_Bal and Avg_Utilization_Ratio are also correlated.
  • Credit_Limit and Avg_Utilization_Ratio have a negative correlation, which means customers with higher credit limit has lower average utilization. Like Credit_limit same observation can be made on Avg_Open_To_Buy with regards to Avg_Utilization_Ratio.
  • Total_Relationship_Count has negative correlation with Total_Trans_Amt and Total_Trans_Ct.
  • Rest of the feaures do not show any significant correlation.

The above pairplots show some interesting patterns and we see that attrited customers have taken distinct spaces in some pairplots. Lets elaborate them further in the below charts.

Relation of Total_Trans_Amt with Customer_Age and Gender

Observations:

  • Total_Trans_Amt seems to be an important feature. And we clearly see four layered cake pattern in the data with respect to Total_Trans_Amt. It appears that the dataset is composed of multiple datasets based on Total_Trans_Amt.
  • Regardless of Age, Gender and Education, no customer has attrited who has Total_Trans_Amt above 12000.
  • High concentration of attrited customers are seen below 3000 Total_Trans_Amt and between 4500 to 11000 Total_Trans_Amt.
  • Classification models would probably give very high importance to this feature.

Relation of Total_Revolving_Bal with Customer_Age and Gender

Observations:

  • Again we see high concentration of Attrited customers where Total_Revolving_Bal is lower than 700.
  • It appears that Total_Revolving_Bal will be an important feture in classification models.

Relation of Avg_Utilization_Ratio with Total_Trans_Amt and Marital_Status

Observations:

  • We see that majority of attrited customers are clustered below 3000 Total_Trans_Amt and towards the lower Avg_Utilization_Ratio.
  • A similar second cluster of attrited customers is seen between 7000 to 11000 Total_Trans_Amt and lower Avg_Utilization_Ratio
  • There is also a small cluster of attrited customers around 4000 to 7000 of Total_Trans_Amt and very low Avg_Utilization_Ratio.
  • We expect to see Avg_Utilization_Ratio to be an important feture in classification models.

Relation of Total_Amt_Chng_Q4_Q1 with Total_Ct_Chng_Q4_Q1 and Income_Category

Observations:

  • Majority of attrited customers are concentrated within 1.0 of Total_Ct_Chng_Q4_Q1 and Total_Amt_Chng_Q4_Q1.
  • We expect to see Total_Ct_Chng_Q4_Q1 and Total_Amt_Chng_Q4_Q1 to be important fetures in classification models.

Feature Engineering and Missing Value Treatment

Custom Transformers

We noticed that there are certain columns which contains redundant or irrelvant information. We would like to drop these columns as part of pre-processing. In order to drop columns using pipeline later, let's write a custom transformer.

We also need to replace Target variable values to 1 and 0 integer datatype from Object/String data type for machine learning models. Again we are going to write a custom transformer for this purpose as later we can use the same in pipeline building.

Apply Transformations to Original Dataset

Below columns can be dropped without loosing much information for the Model building purpose.

  • CLIENTNUM: Does not contain any useful information.
  • Avg_Open_To_Buy: Strong correlation with Credit_Limit
  • Total_Trans_Ct: High correlation with Total_Trans_Amt

Split data to train, validation and test sets

Before applying other transformation we need to split the data into train, validation and test sets. This will ensure no data leakage from test and validation sets to training data.

Impute Missing Values - Education_Level and Marital_Status

Impute Missing Values - Income_Category

Income_Category 'abc' will be treated as missing value and will be replaced with mode value. During pipeline building we will use another methond using FunctionTransformer to imput this feature.

Model Building

We are going to now build several classification models and evalute their performance based on training, validation and test sets. At the end we'll pick the best model, build a production pipeline and provide recommendations for "The Thera bank" credit card department. We are going to follow below steps:

  1. Build 6 models with original data
  2. Build 6 models with oversampled data
  3. Build 6 models with undersampled data
  4. Choose 3 best models among 18 models built in the previous 3 steps
  5. Tune 3 models
  6. Choose one best model and productionize it using pipelines

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a person who has churned or attrited but in reality he/she did not.
  2. Predicting a person who has not churned or attrited but in reality he/she did.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Import necessary libraries for model building

Build 6 models with original data

Build six different models with default settings and use cross validation to get average performance of the models

Check the performance on Validation set

Observations:

  • Default XGBoost classifier is giving a good performance followed by GradientBoostClassifier on validation set.
  • Performance of the models in Validation set is slightly better than training set.
  • Cross Validation is helping models not to overfit.

Build 6 models with oversampled data

Notice that after oversampling Attrited Customers datapoints in Training set has increased from 976 to 5099.

Check the performance on Validation set

Observations:

  • With oversampled synthetic data, models have overfitted in the training set. All models are giving recall score above 0.92 but on validation set we do not see such high recall scores.
  • On Validation set GradientBoosting, XGB and RandomForest have given good performance.
  • Performance of the oversampled models in Validation set is slightly better than models trained on original data.

Build 6 models with undersampled data

Notice that after undersampling Existing Customers datapoints in Training set has decreased from 5099 to 976.

Check the performance on Validation set

Observations:

  • In case of undersampling we noticed that models are giving similar perfromance on Training and Validation sets.
  • XGBoost and GradientBoosting are giving similar performance and other models are not too behind.
  • Models are not overfitted when trained with undersampled data.

Choose 3 best models among 18 models built in the previous 3 steps

Based on the recall score performance top 3 models are:

  1. Undersampled GradientBoosting Classifier
  2. Undersampled XGBoost Classifier
  3. Undersampled Bagging Classifier

Let's check what are the most important features from GradientBoosting and XGBoost classifiers:

Observations:

  • Out of 18 models top 3 models are giving almost similar recall scores 0.93 to 0.94.
  • As expected, all the top models have given very high importance to Total_Trans_Amt
  • The top models also have given high importance to Total_Revolving_Bal, Total_Amt_Chng_Q4_Q1 and Total_Ct_Chng_Q4_Q1
  • Models have given high importance to Total_Relationship_Count, Customer_Age with varing degree of importance.

Hyperparameter Tuning of top 3 Models

We will tune the top three models using RandomizedSearchCV method based on the project requirement Please note GridSearchCV is not used as only RandomizedSearchCV has been asked.

First let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

1. Tune the XGBoost Classifier

After tuning, XGBoostClassifier is giving very good performance on the training and validation datasets. The tuned model is giving generalized performance.

2. Tune the GradientBoosting Classifier

After tuning, GradientBoostingClassifier is giving very good performance on the train dataset but somewhat low performance on the validation set.

3. Tune the Bagging Classifier

After tuning, BaggingClassifier is giving very good performance on the train dataset but poor performance on the validation set. The model has overfitted the training data.

Compare Training performance of the top 3 tuned models

Compare Validation performance of the top 3 tuned models

Observations:

  • After hyperparameter tuning, XGBoost classifier has given the top performance for recall score on validation set. And the model is not overfitted.
  • On the other hand, Gradient Boosting and Bagging have very good training performance but somewhat lower performance on validation set. It seems tuned models in this case have actually overfitted the traning data.
  • Based on the analysis so far, we select the XGBoost tuned model as our final model.

Observations:

  • The final XGBoost Tuned model has given highest importance to Total_Trans_Amt.
  • Total_Revolving_Bal, Total_Ct_Chng_Q4_Q1 and Total_Amt_Chng_Q4_Q1 come next in order of importance.
  • Avg_Utilization_Ratio, Total_Relationship_Count, Credit_Limit, Months_Inactive_12_mon are also important features on deciding if a customer is going to close his/her credit card account.

Performance on Test data of the final tuned model

Observations:

  • The final model is performing well on testing data with a recall score close to 0.97.
  • While the accuracy and precision stands at 0.88 and 0.57 respectively.
  • The model is able to idetify the customers who are going to attrite with 97% confidence!
  • The final model is generalized well since training, validation and test performance are close to each other.

Pipelines for productionizing the model

Column Transformer

Check the steps of the production pipeline

Conclusions

We have analyzed The Thera Bank's Credit Card Users Churn data, build different models using orignal, over sampled and under sampled data. Then we selected the top three models and tuned and finally zeroed on the best model. We have also provided the production pipeline script for training the final model from the original dataset.

While doing the exploratory data analysis and evaluating model outcomes we gathered some valueable insights as below:

Customers can be profiled into three groups:

a) High Risk Customers: Having low transaction amount and counts, low total revolving balance, low transaction amout and count ratios between Q4 and Q1, low average credit utilization. These customers across all ages, marital status, gender, education levels are most likely to give up their credit card account.

b) Medium Risk Customers: These customers have relatively higher transaction amount and counts, total revolving balance, transaction amout and count ratios between Q4 and Q1 but having low relationship counts, low credit limit, inactive for few months, relatively older customers, married with lower education and income.

c) Low Risk Customers: These are good customers for bank and they are not likely to close their accounts. These customers have high transaction amount and counts, high total revolving balance, high transaction amout and count ratios between Q4 and Q1, higher credit limit, have many products with bank, relatively younger customers with high credit limit and higher income.

Recommendations

Based on the analysis and evidences following recommendations can be offered to the credit card business department in order to stop or reduce customer attrition.

  1. Bank needs to find ways to keep customers using their credit card, such that total transaction amount, transactions quarter over quarter, total revolving balance can be increased. Varity of credit card offers related to travel, dinning, entertainment etc. need to be rolled out to its customer base who are profiled as high risk in the previous section.

  2. Along with encouraging higher credit card usage, Bank needs to award more credit limit, cross sell other products to improve the relationship counts, rollout special offers for inactive customers and improve customer service to reduce contact points with bank. This will help customers at the borderline to rethink and stay with bank.

  3. High end cards are not very popular with customer base based on the given dataset, bank needs to market strategically high end cards to high income and young populations as these group of customers are more likely to stay with bank.

  4. On the dataset, it appeared that multiple disparate datasets have been combined to create a large dataset. We may need to analyze the data separately for better prdiction of the customers at risk of closing the account.